我们展示了Pytorch Connectomics(Pytc),一个开源深度学习框架,用于体积显微镜图像的语义和实例分割,基于Pytorch。我们展示了Pytc在Connectomics领域的有效性,其旨在在纳米分辨率下进行线粒体,突触像Mitochondria这样的细胞器,以了解动物脑中的神经元通信,代谢和发育。 Pytc是一个可伸缩且灵活的工具箱,可以在不同的尺度上处理数据集,并支持多任务和半监督学习,以更好地利用昂贵的专家注释和培训期间的大量未标记数据。通过在不编码的情况下改变配置选项并且适用于不同组织和成像方式的其他2D和3D分段任务,可以在Pytc中容易地实现这些功能。定量方面,我们的框架在Cremi挑战中实现了突触裂缝分割的最佳性能(以相对6.1美元\%$)和线粒体和神经元核细胞分割的竞争性能。代码和教程在https://connectomics.readthedocs.io上公开提供。
translated by 谷歌翻译
视频实例分段旨在检测视频中的段和跟踪对象。电流接近将图像级分段算法扩展到时间域。然而,这导致时间上不一致的掩模。在这项工作中,我们由于性能瓶颈而导致的掩模质量。通过此激励,我们提出了一种视频实例分段方法,可以减轻由于缺失的检测而存在的问题。由于这不能简单地使用空间信息来解决,因此我们使用帧间关节来利用时间上下文。这允许我们的网络使用来自相邻帧的框预测来重新拍摄缺失的对象,从而克服丢失的检测。我们的方法通过在YouTube-Vis基准上实现35.1%的地图,显着优于先前最先进的算法。此外,我们的方法完全在线,不需要未来的框架。我们的代码在https://github.com/anirudh-chakravarthy/objprop上公开提供。
translated by 谷歌翻译
从显微镜图像体积分段3D细胞核对于生物学和临床分析至关重要,从而实现了细胞表达模式和细胞谱系的研究。然而,神经元核的当前数据集通常包含小于$ 10 ^ {\ text {-} 3} \ mm ^ 3 $的卷,每卷少于500美元,无法揭示大脑区域的复杂性并限制神经元的调查结构。在本文中,我们推动了向子立方毫米秤的任务向前推进了,并用两个完全注释的卷策划了NUCMM数据集:1美元\ mm ^ $电子显微镜(EM)含有几乎整个斑马鱼大脑,大约170,000左右核;还有0.25美元\ mm ^ 3 $ micro-ct(uct)卷,其中鼠标视觉皮层的一部分,大约7,000个核。具有两种成像模态,体积大小和实例数量显着增加,我们在外观和密度中发现了神经元核的大量多样性,对该领域引入了新的挑战。我们还进行统计分析以定量地说明这些挑战。为了解决挑战,我们提出了一种新颖的混合表示学习模型,该模型结合了前景掩模,轮廓图和签名距离变换来生产高质量的3D面罩。 NUCMM数据集上的基准比较表明,我们所提出的方法显着优于最先进的核细胞分割方法。代码和数据可在https://connectomics-bazaar.github.io/proj/nucmm/index.html中获得。
translated by 谷歌翻译
引导深度超分辨率(GDSR)是多模态图像处理中的必要主题,其在同一场景的HR RGB图像的帮助下重建与次优条件的低分辨率的高分辨率(HR)深度映射。为了解决解释工作机制的挑战,提取过度转移的跨模型特征和RGB纹理,我们提出了一种新颖的离散余弦变换网络(DCTNet)来缓解三个方面的问题。首先,离散余弦变换(DCT)模块通过使用DCT来解决来自GDSR的图像域的频道明智的优化问题来重建多通道HR深度特征。其次,我们介绍了一个半耦合特征提取模块,使用共享卷积核,以提取公共功能和私有内核,以提取特定的模态特征。第三,我们采用了边缘注意机制,以突出导致导游的轮廓。广泛的定量和定性评估表明了我们的DCTNET的有效性,这优于以前的最先进方法,具有相对较少的参数。代码将公开。
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译
Image Virtual try-on aims at replacing the cloth on a personal image with a garment image (in-shop clothes), which has attracted increasing attention from the multimedia and computer vision communities. Prior methods successfully preserve the character of clothing images, however, occlusion remains a pernicious effect for realistic virtual try-on. In this work, we first present a comprehensive analysis of the occlusions and categorize them into two aspects: i) Inherent-Occlusion: the ghost of the former cloth still exists in the try-on image; ii) Acquired-Occlusion: the target cloth warps to the unreasonable body part. Based on the in-depth analysis, we find that the occlusions can be simulated by a novel semantically-guided mixup module, which can generate semantic-specific occluded images that work together with the try-on images to facilitate training a de-occlusion try-on (DOC-VTON) framework. Specifically, DOC-VTON first conducts a sharpened semantic parsing on the try-on person. Aided by semantics guidance and pose prior, various complexities of texture are selectively blending with human parts in a copy-and-paste manner. Then, the Generative Module (GM) is utilized to take charge of synthesizing the final try-on image and learning to de-occlusion jointly. In comparison to the state-of-the-art methods, DOC-VTON achieves better perceptual quality by reducing occlusion effects.
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译
This paper illustrates the technologies of user next intent prediction with a concept knowledge graph. The system has been deployed on the Web at Alipay, serving more than 100 million daily active users. Specifically, we propose AlipayKG to explicitly characterize user intent, which is an offline concept knowledge graph in the Life-Service domain modeling the historical behaviors of users, the rich content interacted by users and the relations between them. We further introduce a Transformer-based model which integrates expert rules from the knowledge graph to infer the online user's next intent. Experimental results demonstrate that the proposed system can effectively enhance the performance of the downstream tasks while retaining explainability.
translated by 谷歌翻译